Use helpers in your recordExtractor to make it easier to extract relevant content from your page.

Algolia has a selection of helpers:

  • product
  • article
  • page
  • splitContentIntoRecords
  • codeSnippets
  • docsearch.

product

This helper extracts content from product pages. A “product page” is an HTML page with one of thes JSON-LD schema types:

JavaScript
recordExtractor: ({ url, $, helpers }) => {
  return helpers.product({ url, $ });
}

Response

The helper returns an object with the following properties:

objectID
string

The product page’s URL.

url
string

The product page’s URL (without parameters or hashes).

lang?
string

The language the page content is written in (from the name field of the JSON-LD product schema).

sku
string

The sku field of the JSON-LD schema.

description?
string

The description field of the JSON-LD schema.

image?
string

The image field of the JSON-LD schema.

price?
string

The product’s price, selected from one of these JSON-LD schema fields, in the order:

  1. offers.price
  2. offers.highPrice
  3. offers.lowPrice.
currency?
string

The offers.priceCurrency field of the JSON-LD schema.

category?
string

The category field of the JSON-LD schema.

article

This helper extracts content from article pages. An “article page” is an HTML page with an appropriate JSON-LD schema or meta tag:

One of these JSON-LD schema types:

JavaScript
recordExtractor: ({ url, $, helpers }) => {
  return helpers.article({ url, $ });
}

Response

The helper returns an object with the following properties:

objectID
string

The article’s URL.

url
string

The article’s URL (without parameters or hashes).

lang?
string

The language the article is written in (from the HTML lang attribute)

headline
string

The article’s headline, selected from one of these, in the order:

  1. meta[property="og:title"]
  2. meta[name="twitter:title"]
  3. head > title
  4. First <h1>.
description?
string

The article’s description, selected from one of these, in the order:

  1. meta[name="description"]
  2. meta[property="og:description"]
  3. meta[name="twitter:description"].
keywords
string array

The keywords field of the JSON-LD schem.

tags
string array

Article tags: meta[property="article:tag"].

image?
string

The image associated with the article, selected from one of these, in the order:

  1. meta[property="og:image"]
  2. meta[name="twitter:image"].
authors?
string array

The author field of the JSON-LD schema.

datePublished?
string

The datePublished field of the JSON-LD schema.

dateModified?
string

The dateModified field of the JSON-LD schema.

category?
string

The category field of the JSON-LD schema.

content
string

The article’s content (body copy).

page

This helper extracts text from pages regardless of its type or category.

recordExtractor: ({ url, $, helpers }) => {
  return helpers.page({
    url,
    $,
    recordProps: {
      title: 'head title',
      content: 'body',
    },
  });
}

Response

The helper returns an object with the following properties:

objectID
string

The object’s unique identifier.

url
string

The page’s URL.

hostname
string

The URL hostname (for example, example.com).

path
string

The URL path: everything after the hostname.

depth
number

The URL depth, based on the number of slashes after the domain. For example, http://example.com/ = 1, http://example.com/about = 1, http://example.com/about/ = 2.

fileType
file type

The page’s file type. One of: html, xml, json, pdf, doc, xls, ppt, odt, ods, odp, or email.

contentLength
number

The page length in bytes.

title?
string

The page title, derived from head > title.

description?
string

The page’s description, derived from meta[name="description"].

keywords?
string array

The page’s keywords, derived from meta[name="keywords"].

image?
string

The image associated with the page, derived from meta[property="og:image"].

headers?
string array

The page’s section titles, derived from h1 and h2.

content
string

The page’s content (body copy).

splitContentIntoRecords

This helper extracts text from long HTML pages and splits them into smaller chunks. This can help prevent “Record too big” errors.

Using this example record extractor on a long page returns an array of records, each one smaller than 1,000 bytes.

JavaScript
recordExtractor: ({ url, $, helpers }) => {
  const baseRecord = {
    url,
    title: $('head title').text().trim(),
  };
  const records = helpers.splitContentIntoRecords({
    baseRecord,
    $elements: $('body'),
    maxRecordBytes: 1000,
    textAttributeName: 'text',
    orderingAttributeName: 'part',
  });
  // Produced records can be modified after creation, if necessary.
  return records;
}

When splitting pages, some words will appear in records belonging to the same page. If you don’t want these duplicates to turn up when users search:

  • Set distinct to true in your index. distinct: true
  • Set the attributeForDistinct to your page’s URL. For example, attributeForDistinct: 'url'.
  • Set searchableAttributes’ to be your page title and body content. For example, [ 'searchableAttributes: [ 'title', 'text' ].
  • Add a customRanking to sort from the first split record on your page to the last. For example, customRanking: [ 'asc(part)' ].
JavaScript
initialIndexSettings: {
  'my-index': {
    distinct: true,
    attributeForDistinct: 'url'
    searchableAttributes: [ 'title', 'text' ],
    customRanking: [ 'asc(part)' ],
  }
}

Response

Specify one or more response parameters in your helper to determine what information is returned.

baseRecord
record
default:"{}"

Takes this record’s attributes (and values) and adds them to all the split records.

$elements
string
default:"$('body')"

A Cheerio selector that determines from which elements content will be extracted. For more information, see Extracting data with Cheerio.

maxRecordBytes
number
default:"10000"

Maximum number of bytes allowed per record. To avoid errors, check your plan’s record size limits.

orderingAttributeName
string

This attribute stores the sequentially generated number assigned to each record when the helper splits a page.

textAttributeName
string
default:"text"

Name of the attribute in which to store the text of each split record.

codeSnippets

Use this helper to extract code snippets from pages. The helper finds code snippets by looking for <pre> tags and extracting the content and the language class prefix from the tag.

If the crawler finds several code snippets on a page, the helper returns a list of those snippets.

JavaScript
recordExtractor: ({ url, $, helpers }) => {
  const code = helpers.codeSnippets({ tag, languageClassPrefix })
  return { code };
}

Response

The helper returns an array of code objects with the following properties:

content
string

The code snippet.

languageClassPrefix?
string

The code snippet’s language (if found).

codeUrl?`
string

The URL of the nearest sibling <a> tag.

fragmentUrl?
string

Text fragment URL with the code snippet. This is a selection of text within a page that’s linked to another page.

docsearch

This helper extracts content and formats it to be compatible with DocSearch. It creates an optimized number of records for relevancy and hierarchy.

You can also use it without DocSearch or to index non-documentation content. For more information, see the DocSearch documentation.

JavaScript
recordExtractor: ({ url, $, helpers }) => {
  return helpers.docsearch({
    aggregateContent: true,
    indexHeadings: true,
    recordVersion: 'v3',
    recordProps: {
      lvl0: {
        selectors: "header h1",
      },
      lvl1: "article h2",
      lvl2: "article h3",
      lvl3: "article h4",
      lvl4: "article h5",
      lvl5: "article h6",
      content: "main p, main li",
    },
  });
}